# packages that we need for the notebook
require(preprocessCore)
require(sva)
require(pcaMethods)
require(gplots)
require(e1071)
source("black_box_svm.R")

Before starting

Introduction

Good models in machine learning require high quality data. Otherwise happens what we call garbage in, garbage out. Furthermore, prediction models need standardized data; otherwise the trained model will only work for the same particular set of data that was used for training. For these reasons, the first steps in machine learning are composed of data exploration and quality control; and are crucial to ensure that all the downstream tasks are not doomed from the start.

In this notebook we will be using a transcriptomics dataset of ColoRectal Cancer (CRC). CRC has four different subtypes (CMS1, CMS2, CMS3 and CMS4) that it can be classified to (you can read more about it here). Our objective is to train a machine learning model to classify a CRC sample into the four subtypes. For example, this would be useful if different subtypes receive different treatments, and we want to make sure that a new patient gets the best treatment possible.

We will go through a checklist of the main tasks to do before embarking in any machine learning process. This includes:

Let’s think about two characteristics about the data:

The dataset

A little secret

  • Although we are using the core data from the paper, it has been artificially modified to include examples of some problems that we might encounter in reality.

Loading the dataset:

sample_data <- read.table(file = 'rawdata/samples_data.txt', header = TRUE, sep = '\t')
expr_data <- read.table(file = 'rawdata/expr_data.txt', header = TRUE, sep = '\t')
gene_data <- read.table(file = 'rawdata/gene_data.txt', header = TRUE, sep = '\t')

We have just read 3 tables that contain the following information:

The dimensions of expr_data tell us about the amount of genes and samples.

dim(expr_data)
## [1] 20742    56

We have 56 samples and 20742 genes.

We can take a look at the sample_data data.frame to see how does our experiment look like.

sample_data
##    SampleId sample_name batch  CMS
## 1         1   sample_14     1 CMS1
## 2         2   sample_17     1 CMS1
## 3         3   sample_68     1 CMS1
## 4         4   sample_71     1 CMS1
## 5         5   sample_67     1 CMS2
## 6         6   sample_69     1 CMS2
## 7         7   sample_15     1 CMS2
## 8         8   sample_72     1 CMS2
## 9         9  sample_107     1 CMS1
## 10       10   sample_70     1 CMS2
## 11       11   sample_30     1 CMS3
## 12       12   sample_29     1 CMS3
## 13       13    sample_4     1 CMS3
## 14       14   sample_31     1 CMS4
## 15       15   sample_12     1 CMS4
## 16       16   sample_33     1 CMS4
## 17       17    sample_2     1 CMS4
## 18       18   sample_16     1 CMS4
## 19       19   sample_34     1 CMS4
## 20       20   sample_14     2 CMS1
## 21       21   sample_17     2 CMS1
## 22       22   sample_68     2 CMS1
## 23       23   sample_71     2 CMS1
## 24       24   sample_67     2 CMS2
## 25       25   sample_69     2 CMS2
## 26       26  sample_108     2 CMS3
## 27       27   sample_15     2 CMS2
## 28       28   sample_72     2 CMS2
## 29       29   sample_70     2 CMS2
## 30       30   sample_30     2 CMS3
## 31       31   sample_29     2 CMS3
## 32       32    sample_4     2 CMS3
## 33       33   sample_31     2 CMS4
## 34       34   sample_12     2 CMS4
## 35       35   sample_33     2 CMS4
## 36       36    sample_2     2 CMS4
## 37       37   sample_16     2 CMS4
## 38       38   sample_34     2 CMS4
## 39       39   sample_14     3 CMS1
## 40       40   sample_17     3 CMS1
## 41       41   sample_68     3 CMS1
## 42       42   sample_71     3 CMS1
## 43       43   sample_67     3 CMS2
## 44       44   sample_69     3 CMS2
## 45       45   sample_15     3 CMS2
## 46       46   sample_72     3 CMS2
## 47       47   sample_70     3 CMS2
## 48       48   sample_30     3 CMS3
## 49       49   sample_29     3 CMS3
## 50       50    sample_4     3 CMS3
## 51       51   sample_31     3 CMS4
## 52       52   sample_12     3 CMS4
## 53       53   sample_33     3 CMS4
## 54       54    sample_2     3 CMS4
## 55       55   sample_16     3 CMS4
## 56       56   sample_34     3 CMS4
str(sample_data)
## 'data.frame':    56 obs. of  4 variables:
##  $ SampleId   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sample_name: Factor w/ 20 levels "sample_107","sample_108",..: 4 7 16 19 15 17 5 20 1 18 ...
##  $ batch      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ CMS        : Factor w/ 4 levels "CMS1","CMS2",..: 1 1 1 1 2 2 2 2 1 2 ...
table(sample_data$CMS)
## 
## CMS1 CMS2 CMS3 CMS4 
##   13   15   10   18

What can you say about the data?

  • There are 56 samples

  • Each sample is done in triplicate.

  • There are 3 batches.

  • Each replicate is in a different batch.

  • Not all CMS subtypes are represented equally.

Data normalization

What is data normalization?

Normalization is a very broad term, it could also include data scaling and batch correction. For this notebook we will keep them separate and define normalization as the process to ensure that all the different samples are comparable between them.

Again, think about the type of data that we are working with and about the design of the experiment. Some examples:

  • Microarray data is composed of intensities, a continuous value, while RNA-seq data is composed of counts, a discrete value. For these two methods, although both measure gene expression, not all normalization approaches will be valid.
  • In proteomics we could measure whole-proteome or measure protein binding to a certain protein that is being pulled down. In the first case we could assume that all samples should have a similar total amount of protein, but in the second case that would not be possible since each different condition could give drastic changes to the amount of bound proteins.

Different types of data -> different approaches for normalization

Different types of experiments -> different approaches for normalization

Just before we start

This is microarray data that has already been background corrected using the RMA method, moreover the data is log2 transformed.

Looking at the data

A good starting point is to plot the different samples using boxplots. This kind of visualization can give as a general overview of what is the data distribution in each sample.

boxplot(expr_data)

This plot already shows some interesting distributions. What do you see?

  • There are batch effects. These kind of batch effects should be already known in advance. It is important to know how the samples were processed and annotate which samples belong to which batch.

  • There are two samples that have a much lower overall gene expression, perhaps something went wrong with them.

  • Overall all most of the samples have a similar distribution with slighly different means.

The weird samples

Because normalization takes into account data from all the samples it is important to make sure that all the samples are actually good. At first glance there seems to be two bad samples. Therefore, we have to make a decision on what to do with them:

  • We keep them.
  • We trash them.

What can we do to explore a bit better these samples? Any ideas on what might be wrong with them?

We can for example look at the amount of missing values, do they just have overall low expression of many genes or is it just that they are not measured at all.

Let’s take a look at the amount of missing values

## sum the amount of missing values in each column (sample)
apply(expr_data, 2, function(x) sum(is.na(x)))
##    X1    X2    X3    X4    X5    X6    X7    X8    X9   X10   X11   X12   X13 
##    27    26    41    25    32    30    24    32 14462    36    26    33    23 
##   X14   X15   X16   X17   X18   X19   X20   X21   X22   X23   X24   X25   X26 
##    29    23    30    24    26    27    28    32    35    17    29    17 13151 
##   X27   X28   X29   X30   X31   X32   X33   X34   X35   X36   X37   X38   X39 
##    33    21    38    25    31    21    32    23    24    19    35    17    39 
##   X40   X41   X42   X43   X44   X45   X46   X47   X48   X49   X50   X51   X52 
##    31    23    34    28    30    26    23    24    27    30    29    27    25 
##   X53   X54   X55   X56 
##    19    18    24    23

We an see that more than 2/3 of the probes have a missing value. That is not a good sign and will definitely affect our normalization statistics. Therefore we can remove these samples.

## select column numbers with more than 10000 missing values
bad_samples <- which(apply(expr_data, 2, function(x) sum(is.na(x))) > 10000)
## remove such samples from our tables
expr_data <- expr_data[,-bad_samples]
sample_data <- sample_data[-bad_samples,]

We can check back at the boxplot to see if we have removed the correct ones.

boxplot(expr_data)

Trashing samples is a simple method. It might be possible or not depending on the amount of samples that one has. Moreover, carefully consider if a bad sample is really due to technical error, perhaps there is some biology behind, and throwing away such samples might add bias to our experiment.

Normalization approaches

Now we can actually normalize the data. There are several methods to do so, some more correct than others depending on the experiment design and the data type. The important concept here is that we want to make the measurements of genes comparable between samples.

For example, if geneA has an expression of 2 in sampleA and an expression of 4 in sampleB; we want to be able to say that sampleB has double the expression because it is real and not because sampleB had double the input of mRNA in the microarray.

And finally, normalization implies that we have some assumptions about the data and the experiment. Different normalization approaches imply different assumptions, and these can help us decide which method fits best for our experiment.

What would be some fair assumptions about this experiment?

  • All the samples should have a comparable total amount of mRNA.

  • All the samples should have similar distributions.

With these assumptions mean substraction, median subsetraction or quantiles normalization would work. Furthermore, it is quite established that for microarray data quantiles normalization works well. Nevertheless we can explore what would these normalizations do to our data distributions.

Mean substraction

sample_mean <- apply(expr_data, 2, mean, na.rm = T)
boxplot(sweep(x = expr_data, MARGIN = 2, STATS = sample_mean, FUN = '-'))

Median substraction

sample_median <- apply(expr_data, 2, median, na.rm = T)
boxplot(sweep(x = expr_data, MARGIN = 2, STATS = sample_median, FUN = '-'))

Quantiles normalization

boxplot(normalize.quantiles(as.matrix(expr_data)))

norm_data <- normalize.quantiles(as.matrix(expr_data))

Questions

  • Let’s say you have done a proteomics pull down experiment on your favorite protein to see which are their binding partners. On one condition you have the WT protein; but in the other you deleted a piece of the protein genetically, and you expect that a significant portion of binding partners will not be able to bind now. How would you normalize such data?

Batch effect removal

Batch effects are changes in the biological data that come from external factors. Different personnel performed the experiment, different machines, different days, different reagents, … These are important to correct, since they can have an important effect on the data, and lead us to the wrong conclusions. For this reason, it is important to check if we see any of these effects on the data. But even more important, is to know if we should expect these effects since we have designed and performed the experiment. And even if we have not done the experiment personally, we should know how it has been done and what is its design.

In the first boxplots that we made, we saw that there is some batch effect going on. We also know which are the samples that have been processed in the same batch from the sample_data data.frame.

After we have normalized the data, it seems like the batch effect is gone. But is it really gone? Other approaches to check for batch effects are unsupervised clustering and dimensionality reduction. We will not go into detail about explaining these since they are not the objective of this notebook. They are actually explained in depth in the PCA and Clustering notebook.

Quick explanation about PCA: each gene is a dimension, to plot all the data in one go we would need a ~25000 dimensional plot (which is of course impossible). To solve that, we join all the genes that behave similarly in one dimension. In the plots, each dot is a sample, similar samples will be together in the 2D space; and different samples will be separated. Keep in mind that these 2 dimensions are a subset of all the data, which means that although two samples might seem far away in a dimension they might be close in another dimension. PCAs are also ordered by the amount of variance that they explain, therefore PC1 will show the most variance, then PC2, PC3…

Quick explanation about unsupervised clustering: we will calculate the distance between a sample and all the others samples. Samples are like points in ~25000 dimensional space, therefore we can calculate their distance like this. We can do this for all the samples and we can see which ones are most similars and which are not. Notice that the formula we are using is for Euclidean distance, there are other distances and each one has its purpouses and assumptions.

Checking by PCA

We can do a quick Principle Components Analysis (PCA) to see how the samples cluster together (similar samples should be together).

What do you expect to see in the PCA?

  • Samples from the same CMS should cluster together.

  • Triplicates within the CMS cluster should be closer together.

We will do a PCA and plot the scores of each sample and color them by CMS.

## calculate the PCA components
pca_res <- pca(t(norm_data))
## plot PC1 and PC2 and color by CMS subtype
plot(pca_res@scores[,1], pca_res@scores[,2], col = sample_data$CMS)

And the same plot colored by batch.

## color by batch
plot(pca_res@scores[,1], pca_res@scores[,2], col = sample_data$batch)

What do these PCA plots show?

  • There is still a clear batch effect since the samples are clustered by batch.

Checking by distance matrix

We can also calculate what is the distance between all the samples to see which ones are most similar and which are most different.

## calculate the distance matrix between samples
distance_matrix <- as.matrix(dist(t(norm_data), method = "euclidean", upper = TRUE, diag = TRUE))
cms_colors <- c('red', 'blue', 'green', 'black')
batch_colors <- c('red', 'blue', 'green')
cms_colors_col <- cms_colors[sample_data$CMS]
batch_colors_col <- cms_colors[sample_data$batch]
heatmap.2(x = distance_matrix, 
          trace = 'none', 
          ColSideColors = cms_colors_col)

The more red the color, the more similar two samples are with each other. The colors at the top display the CMS subtype. Do you see any clusters?

heatmap.2(x = distance_matrix, 
          trace = 'none', 
          ColSideColors = batch_colors_col)

What do these heatmaps show?

  • There is still a clear batch effect since the samples are clustered by batch.

Correcting the batch effect

To correct the batch effect several methods have been implemented. For microarray data particularly, the ComBat method has been shown to work particularly well. But for other types of data, single-cell rna-seq data there are other methods.

Notice that batch correction is a supervised approach, meaning that we need to know a priori which are the different batches.

batch_corr_data <- sva::ComBat(norm_data, batch = sample_data$batch)
## Found3batches
## Adjusting for0covariate(s) or covariate level(s)
## Found1471Missing Data Values
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data

Now try to explore again the batch corrected data and see what happened.

Explore by PCA

pca_res <- pca(t(batch_corr_data))
plot(pca_res@scores[,1], pca_res@scores[,2], col = sample_data$CMS)

plot(pca_res@scores[,1], pca_res@scores[,2], col = sample_data$batch)

  • What has changed?

Explore by distance

distance_matrix <- as.matrix(dist(t(batch_corr_data), upper = TRUE, diag = TRUE))
heatmap.2(x = distance_matrix, 
          trace = 'none', 
          ColSideColors = cms_colors_col)

heatmap.2(x = distance_matrix, 
          trace = 'none', 
          ColSideColors = batch_colors_col)

Questions

  • Why are not the different CMS subtypes not perfectly clustered together even after normalization and batch correction?

Scaling

Scaling is also part of the normalization process. Scaling can be done on the samples and/or on the features.

In our example we already had the data a scaled a bit since we are working with log transformed values. Therefore making genes with really high expression less valuable compared to lower expressed genes. But we might consider scaling all the genes between 0-1, so that we only compare relative expression between samples.

Scaling methods

There are several scaling methods, this article has a nice table with several scaling methods that can be used.

Dealing with missing values

Missing values are quite common and it is important to understand why they exist in our dataset. Again, consider the type of data and the experiment design when thinking about this.

These are the different types:

There are different types of missing values, it is crucial to be able to differentiate between them whenever possible since the methods that are used to deal with each type are different. Sadly, most of the times we have a combination of these different types and it can be difficult to confidently tell which on is which.

In R missing values are usually denoted as NA, but it might be different in other platforms. Consider also if 0 is a missing value or means that it was not truly measured.

Let’s explore how many missing values we have.

sum(is.na(expr_data))
## [1] 1471

And how many non missing values.

sum(!is.na(expr_data))
## [1] 1118597

So there are not so many missing values. Still we want to decide what to do with them. So let’s try to find out which type of missing values are they.

How would you check which kind of missing values are they?

  • See if there are any samples with a lot/very few missing values.

  • See if there are any features with a lot/very few missing values.

  • See in which part of the distribution are the missing values.

Missing values per sample.

apply(expr_data, 2, function(x) sum(is.na(x)))
##  X1  X2  X3  X4  X5  X6  X7  X8 X10 X11 X12 X13 X14 X15 X16 X17 X18 X19 X20 X21 
##  27  26  41  25  32  30  24  32  36  26  33  23  29  23  30  24  26  27  28  32 
## X22 X23 X24 X25 X27 X28 X29 X30 X31 X32 X33 X34 X35 X36 X37 X38 X39 X40 X41 X42 
##  35  17  29  17  33  21  38  25  31  21  32  23  24  19  35  17  39  31  23  34 
## X43 X44 X45 X46 X47 X48 X49 X50 X51 X52 X53 X54 X55 X56 
##  28  30  26  23  24  27  30  29  27  25  19  18  24  23
hist(apply(expr_data, 2, function(x) sum(is.na(x))))

Seems that most samples have between 30-35 missing values. We actually already took care of two samples before that had a lot of missing values. So it doesn’t look like there is anything too worrying here. If all the missing values where in one sample for example, then we might want to further investigate that.

Missing values per gene.

table(apply(expr_data, 1, function(x) sum(is.na(x))))
## 
##     0     1     2    15    18    19    20 
## 20211   472     8     1     6    20    24

Most genes have 0 missing values and then there are some have from 1 to 20.

Notice anything strange about these values?

  • Why such a change from 2 to 18 missing values?.

Perhaps by using a heatmap we can spot what is happening.

We can use colors to tell the different subtypes apart: - Orange: CMS1 - Blue: CMS2 - Green: CMS3 - Yellow: CMS4

colors <- c('CMS1' = "#E69F00", 'CMS2' = "#56B4E9", 
            'CMS3' = "#009E73", 'CMS4' = "#F0E442")
colors_vec <- sapply(sample_data$CMS, function(x) {
  return(colors[as.character(x)])
})
## reduce the plot to only genes that at least 1 missing value
genes_with_nas <- which(apply(expr_data, 1, function(x) sum(is.na(x))) > 0)
reduced_mat <- as.matrix(expr_data[genes_with_nas,])

heatmap.2(reduced_mat, Rowv = NULL, Colv = NULL, 
          dendrogram = 'none', scale = 'none', trace = 'none', 
          ColSideColors = colors_vec, na.color = 'grey30')

What are your first impressions?

  • It looks quite random, although most of the data is at the lower end of the expression distribution. Probably most of these missing values are just “random” because of the limit of detection.

We can focus only on the genes that have between 1 and 5 missing values.

genes_with_nas <- which(apply(expr_data, 1, function(x) sum(is.na(x))) > 0 & 
                          apply(expr_data, 1, function(x) sum(is.na(x))) < 6)
reduced_mat <- as.matrix(expr_data[genes_with_nas,])

heatmap.2(reduced_mat, Rowv = NULL, Colv = NULL,
          dendrogram = 'none', scale = 'none', trace = 'none', 
          ColSideColors = colors_vec, na.color = 'grey30')

What are your first impressions?

  • Very similar to as before, this is expected since most of the genes had less than 3 missing values.

We can check the genes with more than 5 missing values

genes_with_nas <- which(apply(expr_data, 1, function(x) sum(is.na(x))) > 5)
reduced_mat <- as.matrix(expr_data[genes_with_nas,])

heatmap.2(reduced_mat, Rowv = FALSE, Colv = FALSE, 
          dendrogram = 'none', scale = 'none', trace = 'none', 
          ColSideColors = colors_vec, na.color = 'grey30')

What are your first impressions?

  • These are genes that are clear on the low end of expression. Probably missing because of the limit of detection. Besides that it would seem random.
  • There is one gene that stands out! it appears that is detected with high expression in all samples but the ones in CMS2!

So far we have explored our missing values. It looks like the majority of them are random, with emphasis at missing at random but with a detection limit component. Finally there is one gene that has not at random missing values.

Now we have to decide what do we do with these missing values. Again, depending on the type of data and experiment we might want to keep the missing values, remove them or impute them. It is important to decide whether the missingness of a value has information or not.

Some options of what we can do:

What would you do?

Dealing with missing values is quite complex and is extremely dependent on the experiment. In this particular case our final objective is to build a predictor, a predictor needs to be robust, therefore using genes that are sometimes detected and sometimes not is not a good idea. Since most of the values seem random imputation based on the most similar sample would probably work fine. On the other hand, since there are not so many genes that actually have missing values we could also not use them.

We can also try different methods, and then see how the data looks like; specially in this case since we are adding data that will be as important as the real data we want to check that we are not creating any aberrations.

Removing missing genes.

genes_to_remove <- which(apply(expr_data, 1, function(x) sum(is.na(x))) > 0)

reduced_data <- expr_data[-genes_to_remove,]

Setting missing values as the sample mean.

mean_col <- apply(expr_data, 1, mean, rm.na = TRUE)
imputed_data <- expr_data
for (i in seq_len(ncol(imputed_data))) {
  
  col_data <- imputed_data[,i]
  col_data[is.na(col_data)] <- mean_col[i]
  imputed_data[,i] <- col_data
  
}

Setting missing values as a lower quantile value.

quantile(expr_data, na.rm = T)
##        0%       25%       50%       75%      100% 
##  2.053516  3.555102  4.709975  7.330005 17.865068
imputed_data <- expr_data
imputed_data[is.na(imputed_data)] <- quantile(expr_data, na.rm = T)[1]

norm_imputed <- normalize.quantiles(as.matrix(imputed_data))

norm_imputed_batchcorr <- sva::ComBat(norm_imputed, batch = sample_data$batch)
## Found3batches
## Adjusting for0covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data

Questions

  • How can you determine if a sample has too many missing values?
  • When do we impute/remove missing values? Before or after normalization?

Feature selection (pre-ML)

This part will very briefly touch on aleviating the amount of data given to to train/predict. We have measured ~25000 genes, but to predict the different subtypes just a few of them are necessary. In this follow-up paper they built a classifier with 113 probes. Therefore it would reduce the amount of computation time and difficulty for the algorithms if we could a priori reduce the amount of features. There is also the consideration of robustness, reducing the amount of required features is nice, but reducing it too much might make our method too unstable, having some redundancy is also nice.

This part is highly dependent on the machine learning method used, some methods have feature selection already incorporated in the algorithm itself, while some others do not. You can read about it here.

Nevertheless, let’s say that our algorithm does not have feature selection.

IMPORTANT: feature selection is part of ML, therefore we must apply cross-validation methods to ensure that we are not overestimating the performance of our model.

What kind of metrics could we used to reduce the amount of features?

  • Features that have near zero variance: if the expression of a gene does not change across samples it will not be informative.

  • Correlated features: if two or more genes have a highly correlated expression, they contain the same information.

  • Linear dependencies: if the expression of a gene can be calculated by the mathematical combination of two or more other genes.

Zero variance features

We can check how many features have low variance.

hist(apply(batch_corr_data, 1, var, na.rm = T), xlim = c(0, 5), breaks = 100, main = "Variance")

As we can see, many of them have very low variance, these are not very informative.

Does higher variance correlate with better class separation?

  • We actually do not know that. It simply tells us how disperse is a feature.

In this case we could remove genes that have less than a certain variance threshold.

relevant_genes <- which(apply(batch_corr_data, 1, var, na.rm = T) > 0.5)

batch_corr_data <- batch_corr_data[relevant_genes,]

length(relevant_genes)
## [1] 4528

Highly correlated genes

We could remove groups of genes that have high correlation, since in that case they would contain the same information.

This of course requires to set up a threshold on what does high correlation means.

descrCor <-  cor(batch_corr_data, use = 'complete.obs')

hist(descrCor)

Perhaps we could compress all the highly correlated genes into a single meta-gene. This would be essentially PCA, which also includes the variance problem described above in it.

Some quick ML

We can try different combinations of the above described methods (normalization, scaling, dealing with missing values) and see what is their effect on training a model.

To get some reliable values we will do a simple quick cross-validation by setting up a subset of the samples separated. Let’s keep for example the 3rd batch out.

With normalization, imputation and batch correction

x_train <- norm_imputed_batchcorr[, sample_data$batch %in% c(1, 2)]
x_validation <- norm_imputed_batchcorr[, sample_data$batch %in% c(3)]
y_train <- sample_data$CMS[sample_data$batch %in% c(1, 2)]
y_validation <- sample_data$CMS[sample_data$batch %in% c(3)]
model <- build_model(x_train, y_train)
predictions <- predict_new_data(x_validation, model)
performance(y_validation, predictions)
##            labels
## predictions CMS1 CMS2 CMS3 CMS4
##        CMS1    4    0    0    0
##        CMS2    0    5    0    0
##        CMS3    0    0    3    0
##        CMS4    0    0    0    6

Let’s try with other datasets, for example if we use norm_imputed that has not been batch corrected, will it work? You can try other states of the dataset, unfortunately this machine learning method does not deal automatically with missing values, so you will have to take care of them first.

With normalization, imputation and NOT batch correction

x_train <- norm_imputed[, sample_data$batch %in% c(1, 2)]
x_validation <- norm_imputed[, sample_data$batch %in% c(3)]
y_train <- sample_data$CMS[sample_data$batch %in% c(1, 2)]
y_validation <- sample_data$CMS[sample_data$batch %in% c(3)]
model <- build_model(x_train, y_train)
predictions <- predict_new_data(x_validation, model)
performance(y_validation, predictions)
##            labels
## predictions CMS1 CMS2 CMS3 CMS4
##        CMS1    0    0    0    0
##        CMS2    0    0    0    0
##        CMS3    0    0    0    0
##        CMS4    4    5    3    6

With normalization, imputation, batch correction and randomized labels

x_train <- norm_imputed[, sample_data$batch %in% c(1, 2)]
x_validation <- norm_imputed[, sample_data$batch %in% c(3)]
y_train <- sample_data$CMS[sample_data$batch %in% c(1, 2)]
y_validation <- sample_data$CMS[sample_data$batch %in% c(3)]

y_train <- sample(y_train, length(y_train))
model <- build_model(x_train, y_train)
predictions <- predict_new_data(x_validation, model)
performance(y_validation, predictions)
##            labels
## predictions CMS1 CMS2 CMS3 CMS4
##        CMS1    0    0    0    0
##        CMS2    0    1    0    0
##        CMS3    0    0    0    0
##        CMS4    4    4    3    6

Beyond data wrangling

Conclusions

In this notebook we have learned about some of the key aspects of data pre-processing, exploration and cleaning. Keep in mind, that the methods that we have applied here are particularly suited for this kind of data and experiment. It is of extreme importance to research which are the adequate methods for each type of dataset and type of experiment.